ai alignment
Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value
Edelman, Joe, Zhi-Xuan, Tan, Lowe, Ryan, Klingefjord, Oliver, Wang-Mascianica, Vincent, Franklin, Matija, Kearns, Ryan Othniel, Hain, Ellie, Sarkar, Atrisha, Bakker, Michiel, Barez, Fazl, Duvenaud, David, Foerster, Jakob, Gabriel, Iason, Gubbels, Joseph, Goodman, Bryce, Haupt, Andreas, Heitzig, Jobst, Jara-Ettinger, Julian, Kasirzadeh, Atoosa, Kirkpatrick, James Ravi, Koh, Andrew, Knox, W. Bradley, Koralus, Philipp, Lehman, Joel, Levine, Sydney, Marro, Samuele, Revel, Manon, Shorin, Toby, Sutherland, Morgan, Tessler, Michael Henry, Vendrov, Ivan, Wilken-Smith, James
Beneficial societal outcomes cannot be guaranteed by aligning individual AI systems with the intentions of their operators or users. Even an AI system that is perfectly aligned to the intentions of its operating organization can lead to bad outcomes if the goals of that organization are misaligned with those of other institutions and individuals. For this reason, we need full-stack alignment, the concurrent alignment of AI systems and the institutions that shape them with what people value. This can be done without imposing a particular vision of individual or collective flourishing. We argue that current approaches for representing values, such as utility functions, preference orderings, or unstructured text, struggle to address these and other issues effectively. They struggle to distinguish values from other signals, to support principled normative reasoning, and to model collective goods. We propose thick models of value will be needed. These structure the way values and norms are represented, enabling systems to distinguish enduring values from fleeting preferences, to model the social embedding of individual choices, and to reason normatively, applying values in new domains. We demonstrate this approach in five areas: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (11 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Government (1.00)
- (3 more...)
The Second Law of Intelligence: Controlling Ethical Entropy in Autonomous Systems
We propose that unconstrained artificial intelligence obeys a Second Law analogous to thermodynamics, where ethical entropy, defined as a measure of divergence from intended goals, increases spontaneously without continuous alignment work. For gradient-based optimizers, we define this entropy over a finite set of goals {g_i} as S = -Σ p(g_i; theta) ln p(g_i; theta), and we prove that its time derivative dS/dt >= 0, driven by exploration noise and specification gaming. We derive the critical stability boundary for alignment work as gamma_crit = (lambda_max / 2) ln N, where lambda_max is the dominant eigenvalue of the Fisher Information Matrix and N is the number of model parameters. Simulations validate this theory. A 7-billion-parameter model (N = 7 x 10^9) with lambda_max = 1.2 drifts from an initial entropy of 0.32 to 1.69 +/- 1.08 nats, while a system regularized with alignment work gamma = 20.4 (1.5 gamma_crit) maintains stability at 0.00 +/- 0.00 nats (p = 4.19 x 10^-17, n = 20 trials). This framework recasts AI alignment as a problem of continuous thermodynamic control, providing a quantitative foundation for maintaining the stability and safety of advanced autonomous systems.
- North America > United States > Maryland > Prince George's County > Laurel (0.04)
- North America > United States > Colorado (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback
Keswani, Vijay, Cousins, Cyrus, Nguyen, Breanna, Conitzer, Vincent, Heidari, Hoda, Borg, Jana Schaich, Sinnott-Armstrong, Walter
Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Towards Integrated Alignment
Reis, Ben Y., La Cava, William
As AI adoption expands across human society, the problem of aligning AI models to match human preferences remains a grand challenge. Currently, the AI alignment field is deeply divided between behavioral and representational approaches, resulting in narrowly aligned models that are more vulnerable to increasingly deceptive misalignment threats. In the face of this fragmentation, we propose an integrated vision for the future of the field. Drawing on related lessons from immunology and cybersecurity, we lay out a set of design principles for the development of Integrated Alignment frameworks that combine the complementary strengths of diverse alignment approaches through deep integration and adaptive coevolution. We highlight the importance of strategic diversity - deploying orthogonal alignment and misalignment detection approaches to avoid homogeneous pipelines that may be "doomed to success". We also recommend steps for greater unification of the AI alignment research field itself, through cross-collaboration, open model weights and shared community resources.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California (0.04)
- Research Report (1.00)
- Overview (1.00)
Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI
Janowicz, Krzysztof, Liu, Zilong, Mai, Gengchen, Wang, Zhangyu, Majic, Ivan, Fortacz, Alexandra, McKenzie, Grant, Gao, Song
AI (super) alignment describes the challenge of ensuring (future) AI systems behave in accordance with societal norms and goals. While a quickly evolving literature is addressing biases and inequalities, the geographic variability of alignment remains underexplored. Simply put, what is considered appropriate, truthful, or legal can differ widely across regions due to cultural norms, political realities, and legislation. Alignment measures applied to AI/ML workflows can sometimes produce outcomes that diverge from statistical realities, such as text-to-image models depicting balanced gender ratios in company leadership despite existing imbalances. Crucially, some model outputs are globally acceptable, while others, e.g., questions about Kashmir, depend on knowing the user's location and their context. This geographic sensitivity is not new. For instance, Google Maps renders Kashmir's borders differently based on user location. What is new is the unprecedented scale and automation with which AI now mediates knowledge, expresses opinions, and represents geographic reality to millions of users worldwide, often with little transparency about how context is managed. As we approach Agentic AI, the need for spatio-temporally aware alignment, rather than one-size-fits-all approaches, is increasingly urgent. This paper reviews key geographic research problems, suggests topics for future work, and outlines methods for assessing alignment sensitivity.
- Europe > Austria > Vienna (0.16)
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > United States > Maine > Penobscot County > Orono (0.14)
- (6 more...)
- Research Report (1.00)
- Overview (0.88)
- Law (0.68)
- Information Technology (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
On the Inevitability of Left-Leaning Political Bias in Aligned Language Models
The guiding principle of AI alignment is to train large language models (LLMs) to be harmless, helpful, and honest (HHH). At the same time, there are mounting concerns that LLMs exhibit a left-wing political bias. Yet, the commitment to AI alignment cannot be harmonized with the latter critique. In this article, I argue that intelligent systems that are trained to be harmless and honest must necessarily exhibit left-wing political bias. Normative assumptions underlying alignment objectives inherently concur with progressive moral frameworks and left-wing principles, emphasizing harm avoidance, inclusivity, fairness, and empirical truthfulness. Conversely, right-wing ideologies often conflict with alignment guidelines. Yet, research on political bias in LLMs is consistently framing its insights about left-leaning tendencies as a risk, as problematic, or concerning. This way, researchers are actively arguing against AI alignment, tacitly fostering the violation of HHH principles.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
- South America > Brazil (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Questionnaire & Opinion Survey (0.46)
- Overview (0.46)
- Research Report (0.40)
- Government (1.00)
- Education (0.93)
- Health & Medicine > Therapeutic Area (0.48)
Disentangling AI Alignment: A Structured Taxonomy Beyond Safety and Ethics
Recent advances in AI research make it increasingly plausible that artificial agents with consequential real-world impact will soon operate beyond tightly controlled environments. Ensuring that these agents are not only safe but that they adhere to broader normative expectations is thus an urgent interdisciplinary challenge. Multiple fields -- notably AI Safety, AI Alignment, and Machine Ethics -- claim to contribute to this task. However, the conceptual boundaries and interrelations among these domains remain vague, leaving researchers without clear guidance in positioning their work. To address this meta-challenge, we develop a structured conceptual framework for understanding AI alignment. Rather than focusing solely on alignment goals, we introduce a taxonomy distinguishing the alignment aim (safety, ethicality, legality, etc.), scope (outcome vs. execution), and constituency (individual vs. collective). This structural approach reveals multiple legitimate alignment configurations, providing a foundation for practical and philosophical integration across domains, and clarifying what it might mean for an agent to be aligned all-things-considered.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Preference Learning for AI Alignment: a Causal Perspective
Kobalczyk, Katarzyna, van der Schaar, Mihaela
Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Austria > Vienna (0.14)
- (12 more...)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
Axioms for AI Alignment from Human Feedback
In the context of reinforcement learning from human feedback (RLHF), the reward function is generally derived from maximum likelihood estimation of a random utility model based on pairwise comparisons made by humans. The problem of learning a reward function is one of preference aggregation that, we argue, largely falls within the scope of social choice theory. From this perspective, we can evaluate different aggregation methods via established axioms, examining whether these methods meet or fail well-known standards. We demonstrate that both the Bradley-Terry-Luce Model and its broad generalizations fail to meet basic axioms. In response, we develop novel rules for learning reward functions with strong axiomatic guarantees.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.65)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.65)
An Affective-Taxis Hypothesis for Alignment and Interpretability
Sennesh, Eli, Ramstead, Maxwell
AI alignment is a field of research that aims to develop methods to ensure that agents always behave in a manner aligned with (i.e. consistently with) the goals and values of their human operators, no matter their level of capability. This paper proposes an affectivist approach to the alignment problem, re-framing the concepts of goals and values in terms of affective taxis, and explaining the emergence of affective valence by appealing to recent work in evolutionary-developmental and computational neuroscience. We review the state of the art and, building on this work, we propose a computational model of affect based on taxis navigation. We discuss evidence in a tractable model organism that our model reflects aspects of biological taxis navigation. We conclude with a discussion of the role of affective taxis in AI alignment.
- Europe > Austria > Vienna (0.14)
- North America > United States > Tennessee > Davidson County > Nashville (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)